LeJEPA Implementation for Text

As a follow up to my previous blog post, I want to implement a LeJEPA architecture adapted for text. The original LeJEPA paper focuses on vision tasks, using a latent prediction approach where the model learns to predict representations of masked image patches from visible context.

In this post, I’ll adapt these principles to text by implementing a transformer-based encoder that predicts latent representations of masked text spans. The key idea is to use span masking - masking contiguous chunks of text rather than random tokens - which encourages the model to learn higher-level semantic representations.

Approach

I’ll implement a LeJEPA-style architecture with:

An encoder that processes visible text tokens.
A predictor that learns to predict the target encoder’s representations of masked spans.

The goal is to train this model from scratch on a text dataset and evaluate how well the learned representations capture semantic information. This is primarily an implementation exercise to better understand how the LeJEPA architecture works and how it can be adapted beyond vision tasks.

For this we’ll need to create functions and classes for: - A LeJEPAText class that defines the LeJEPA architecture. - A LeJEPATextDataModule class that loads and preprocesses the text data.

And use: - A text dataset, for which I’ll use a standard dataset from HuggingFace. - A tokenizer, for which I’ll use a standard open source tokenizer.

Data

Investigating the Dataset

prompt / response

What would be a good dataset to use for this? Please search the web

🤖Reply🤖

🧠🧠🧠🧠🧠🧠🧠🧠🧠

For a text LeJEPA self-supervised learning experiment, here are some good dataset options:

Best for this experiment:

FineWeb / FineWeb-Edu - * Includes 15T tokens of cleaned and deduplicated English web data. * The dataset focuses on cleaner, more informative content that actually helps models learn language patterns, reasoning, and structure. That is why it has become a popular choice for pretraining and fine-tuning large language models.
WikiText-103 - A classic choice for self-supervised text experiments. Clean Wikipedia articles with ~100M tokens - manageable size for experimentation.
OpenWebText - Recreated WebText corpus, good for general text pretraining.

My recommendation for your experiment:

For a blog post / learning exercise, I’d suggest WikiText-103 (wikitext-103-raw-v1) because: - Manageable size (~100M tokens) - Clean, well-structured text - Fast to download and iterate with - Standard benchmark for language modeling

from datasets import load_dataset

# Quick to download, good for experimentation
dataset = load_dataset("wikitext", "wikitext-103-raw-v1")

If you want something larger later, you can use HuggingFaceFW/fineweb-edu with streaming:

# For larger scale (streaming to avoid downloading all 15T tokens)
dataset = load_dataset("HuggingFaceFW/fineweb-edu", split="train", streaming=True)

Would you like me to help set up the DataModule for WikiText-103?

investigation

from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

dataset["train"][:5]

{'text': ['',
  ' = Valkyria Chronicles III = \n',
  '',
  ' Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . \n',
  " The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving for series newcomers . Character designer Raita Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n . \n"]}

prompt / response

Hmmm… it looks like in this dataset the text is split between documents. How can I deal with this in a dataset for training a text JEPA?

🤖Reply🤖

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

Great observation! The standard approach is to concatenate all documents then chunk into fixed-length sequences. This avoids short sequences and wasted padding.

from transformers import AutoTokenizer
from torch.utils.data import Dataset

class ChunkedTextDataset(Dataset):
    def __init__(self, dataset, tokenizer, max_length=128):
        # Concatenate all non-empty text
        all_text = " ".join([t for t in dataset["text"] if t.strip()])
        
        # Tokenize everything at once
        tokens = tokenizer(all_text, return_tensors="pt", add_special_tokens=False)["input_ids"].squeeze()
        
        # Chunk into fixed-length sequences (drop remainder)
        num_chunks = len(tokens) // max_length
        self.chunks = tokens[:num_chunks * max_length].view(num_chunks, max_length)
    
    def __len__(self):
        return len(self.chunks)
    
    def __getitem__(self, idx):
        return {"input_ids": self.chunks[idx]}

# Example usage
tokenizer = AutoTokenizer.from_pretrained("gpt2")
train_dataset = ChunkedTextDataset(dataset["train"], tokenizer, max_length=128)

print(f"Number of chunks: {len(train_dataset)}")
print(f"Sample shape: {train_dataset[0]['input_ids'].shape}")

Key points: - Concatenate: Join all documents with spaces (ignoring empty rows) - Tokenize once: More efficient than per-document tokenization - Fixed chunks: Every sample is exactly max_length tokens - no padding needed - Drop remainder: Discard the last partial chunk for simplicity

This is the same approach used by most LLM pretraining pipelines. You can also add document separator tokens if you want the model to learn document boundaries.

investigation

from transformers import AutoTokenizer
from torch.utils.data import Dataset
from tqdm import tqdm

class ChunkedTextDataset(Dataset):
    def __init__(self, dataset, tokenizer, max_length=128):
        # Concatenate all non-empty text
        all_text = " ".join([t for t in tqdm(dataset["text"]) if t.strip()])
        
        # Tokenize everything at once
        tokens = tokenizer(all_text, return_tensors="pt", add_special_tokens=False)["input_ids"].squeeze()
        
        # Chunk into fixed-length sequences (drop remainder)
        num_chunks = len(tokens) // max_length
        self.chunks = tokens[:num_chunks * max_length].view(num_chunks, max_length)
    
    def __len__(self):
        return len(self.chunks)
    
    def __getitem__(self, idx):
        return {"input_ids": self.chunks[idx]}

tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Concatenate all non-empty text
all_text = " ".join([t for t in tqdm(dataset["train"]["text"]) if t.strip()])

  0%|          | 0/36718 [00:00<?, ?it/s] 31%|███       | 11260/36718 [00:00<00:00, 112588.56it/s] 62%|██████▏   | 22673/36718 [00:00<00:00, 113488.47it/s] 93%|█████████▎| 34085/36718 [00:00<00:00, 113774.77it/s]100%|██████████| 36718/36718 [00:00<00:00, 113158.42it/s]

# Tokenize everything at once
tokens = tokenizer(all_text[:1000000], return_tensors="pt", add_special_tokens=False)["input_ids"].squeeze()

Token indices sequence length is longer than the specified maximum sequence length for this model (219671 > 1024). Running this sequence through the model will result in indexing errors

len(all_text), len(tokens)

(10916756, 219671)

prompt / response

Can you change the tokenizer above to run a batched way? right now it appears to be crashing on the whole dataset

🤖Reply🤖

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

from transformers import AutoTokenizer
from torch.utils.data import Dataset
from tqdm import tqdm
import torch

class ChunkedTextDataset(Dataset):
    def __init__(self, dataset, tokenizer, max_length=128, batch_size=10000):
        # Filter non-empty text
        texts = [t for t in dataset["text"] if t.strip()]
        
        # Tokenize in batches
        all_tokens = []
        for i in tqdm(range(0, len(texts), batch_size), desc="Tokenizing"):
            batch = texts[i:i + batch_size]
            encoded = tokenizer(batch, add_special_tokens=False)["input_ids"]
            for ids in encoded:
                all_tokens.extend(ids)
        
        tokens = torch.tensor(all_tokens)
        
        # Chunk into fixed-length sequences
        num_chunks = len(tokens) // max_length
        self.chunks = tokens[:num_chunks * max_length].view(num_chunks, max_length)
    
    def __len__(self):
        return len(self.chunks)
    
    def __getitem__(self, idx):
        return {"input_ids": self.chunks[idx]}

# Usage
tokenizer = AutoTokenizer.from_pretrained("gpt2")
train_dataset = ChunkedTextDataset(dataset["train"], tokenizer, max_length=128)
print(f"Number of chunks: {len(train_dataset)}")

Key change: Instead of joining all text into one giant string, we tokenize documents in batches of 10,000 and concatenate the token IDs. This avoids the memory spike from both the huge concatenated string and the single tokenizer call.

investigation

from transformers import AutoTokenizer
from torch.utils.data import Dataset
from tqdm import tqdm
import torch

class ChunkedTextDataset(Dataset):
    def __init__(self, dataset, tokenizer, max_length=128, batch_size=10000):
        # Filter non-empty text
        texts = [t for t in dataset["text"] if t.strip()]
        
        # Tokenize in batches
        all_tokens = []
        for i in tqdm(range(0, len(texts), batch_size), desc="Tokenizing"):
            batch = texts[i:i + batch_size]
            encoded = tokenizer(batch, add_special_tokens=False)["input_ids"]
            for ids in encoded:
                all_tokens.extend(ids)
        
        tokens = torch.tensor(all_tokens)
        
        # Chunk into fixed-length sequences
        num_chunks = len(tokens) // max_length
        self.chunks = tokens[:num_chunks * max_length].view(num_chunks, max_length)
    
    def __len__(self):
        return len(self.chunks)
    
    def __getitem__(self, idx):
        return self.chunks[idx]

# Usage
tokenizer = AutoTokenizer.from_pretrained("gpt2")
train_dataset = ChunkedTextDataset(dataset["train"], tokenizer, max_length=256)
print(f"Number of chunks: {len(train_dataset)}")

Tokenizing:   0%|          | 0/3 [00:00<?, ?it/s]Tokenizing:  33%|███▎      | 1/3 [00:00<00:00,  2.24it/s]Tokenizing:  67%|██████▋   | 2/3 [00:00<00:00,  2.87it/s]Tokenizing: 100%|██████████| 3/3 [00:00<00:00,  4.11it/s]Tokenizing: 100%|██████████| 3/3 [00:00<00:00,  3.55it/s]

Number of chunks: 9343

train_dataset[0]

tensor([  796,   569, 18354,  7496, 17740,  6711,   796,   220,   198,  2311,
           73, 13090,   645,   569, 18354,  7496,   513,  1058,   791, 47398,
        17740,   357,  4960,  1058, 10545,   230,    99,   161,   254,   112,
         5641, 44444,  9202, 25084, 24440, 12675, 11839,    18,   837,  6578,
          764,   569, 18354,  7496,   286,   262, 30193,   513,  1267,   837,
         8811,  6412,   284,   355,   569, 18354,  7496, 17740,  6711,  2354,
         2869,   837,   318,   257, 16106,  2597,  2488,    12,    31,  2712,
         2008,   983,  4166,   416, 29490,   290,  6343,    13, 44206,   329,
          262, 14047, 44685,   764, 28728,   287,  3269,  2813,   287,  2869,
          837,   340,   318,   262,  2368,   983,   287,   262,   569, 18354,
         7496,  2168,   764, 12645,   278,   262,   976, 21748,   286, 16106,
          290,  1103,  2488,    12,    31,   640, 11327,   355,   663, 27677,
          837,   262,  1621,  4539, 10730,   284,   262,   717,   983,   290,
         5679,   262,   366, 17871,  5321,   366,   837,   257, 23634,  2422,
         4326,  7351,   262,  3277,   286,  7096,   544,  1141,   262,  5498,
         1898,  6839,  1810,   508,  1620,  3200,  2042,  4560,   290,   389,
        46852,  1028,   262, 11773,  4326,   366,  2199,   321,   265,    88,
        12552,   366,   764,   220,   198,   383,   983,  2540,  2478,   287,
         3050,   837,  6872,   625,   257,  1588,  6903,   286,   262,   670,
         1760,   319,   569, 18354,  7496, 17740,  2873,   764,  2893,   340,
        17383,   262,  3210,  3033,   286,   262,  2168,   837,   340,   635,
        25289,  3294, 16895,   837,   884,   355,  1642,   262,   983,   517,
        43486,   329,  2168, 29661,   764, 15684, 11915,   371,  4548,    64,
         8835,    73,   280,   290, 26777,  7286, 13704, 13231, 43354,  1111,
         4504,   422,  2180, 12784,   837,  1863,   351,   569, 18354,  7496,
        17740,  2873,  3437, 33687,  5303, 18024])

Data Loading and Preprocessing

Okay so I’ve got a dataset and a tokenizer. However, I think we need some further changes to the dataset, so that we can use it for training. To apply JEPA here, we need a “source” batch of tokens and a “target” batch of tokens, where we train the joint representations to be equivalent.

I’m thinking a good way to do this to split texts in the tokenizer __getitem__ method. This will probably need a min_split_size and max_split_size to avoid splits being overly large or small.

prompt / response

Is the approach that I’m suggesting above sensible? Is there anything I might be missing?

One thing I’m wondering is if having different splits as inputs to the model will have an impact on positional encoding.

🤖Reply🤖

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

Your intuition about positional encoding is spot on - this is a critical consideration!

The issue: If you just split a sequence and feed source/target as separate sequences starting from position 0, you lose the positional relationship. The model won’t know where in the original sequence the target spans were.

How JEPA handles this (vision analogy): - Context encoder: visible patches with original position embeddings - Target encoder: masked patches with original position embeddings
- Predictor: takes context embeddings + target positions → predicts target representations

For text, I’d suggest this structure:

def __getitem__(self, idx):
    tokens = self.chunks[idx]  # Full sequence [seq_len]
    
    # Generate span mask (True = masked/target)
    mask = self.generate_span_mask(len(tokens))
    
    return {
        "input_ids": tokens,           # Full sequence
        "target_mask": mask,           # Which positions to predict
        "positions": torch.arange(len(tokens))  # Explicit positions
    }

Then in your model: - Context encoder: processes input_ids[~target_mask] with positions[~target_mask] - Target encoder: processes input_ids[target_mask] with positions[target_mask] - Predictor: uses context output + positions[target_mask] to predict target representations

Key insight: The predictor needs to receive the target positions as input (often via learned position queries or just concatenated position embeddings) so it knows where to predict.

Does this align with how you were thinking about it, or would you like me to sketch out the span masking logic?

prompt / response

Yeah, that makes sense. That being said I don’t think we need the positional key here, as that can be generated in the model.

Can you start by sketching out the generate_span_mask logic? I think I want this so that it splits into 2 spans at a random index, with a min length to avoid splits that are too small for the model to learn.

Try and make this minimal… I essentially just want a boolean mask based on a random index between min_length and seq_len - min_length

🤖Reply🤖

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

import torch
import random

def generate_span_mask(seq_len, min_length=32):
    """Split sequence into context (False) and target (True) at random index."""
    split_idx = random.randint(min_length, seq_len - min_length)
    mask = torch.zeros(seq_len, dtype=torch.bool)
    mask[split_idx:] = True  # Second half is target
    return mask

You could also randomly flip which half is target:

def generate_span_mask(seq_len, min_length=32):
    split_idx = random.randint(min_length, seq_len - min_length)
    mask = torch.zeros(seq_len, dtype=torch.bool)
    mask[split_idx:] = True
    if random.random() > 0.5:
        mask = ~mask  # Flip so first half is target
    return mask

prompt / response

I like version one of this. Can you now integrate this into the dataset logic.

🤖Reply🤖

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

import torch
import random
from torch.utils.data import Dataset
from tqdm import tqdm

class ChunkedTextDataset(Dataset):
    def __init__(self, dataset, tokenizer, max_length=256, batch_size=10000, min_span_length=32):
        self.min_span_length = min_span_length
        
        # Filter non-empty text
        texts = [t for t in dataset["text"] if t.strip()]
        
        # Tokenize in batches
        all_tokens = []
        for i in tqdm(range(0, len(texts), batch_size), desc="Tokenizing"):
            batch = texts[i:i + batch_size]
            encoded = tokenizer(batch, add_special_tokens=False)["input_ids"]
            for ids in encoded:
                all_tokens.extend(ids)
        
        tokens = torch.tensor(all_tokens)
        
        # Chunk into fixed-length sequences
        num_chunks = len(tokens) // max_length
        self.chunks = tokens[:num_chunks * max_length].view(num_chunks, max_length)
    
    def generate_span_mask(self, seq_len):
        split_idx = random.randint(self.min_span_length, seq_len - self.min_span_length)
        mask = torch.zeros(seq_len, dtype=torch.bool)
        mask[split_idx:] = True
        return mask
    
    def __len__(self):
        return len(self.chunks)
    
    def __getitem__(self, idx):
        tokens = self.chunks[idx]
        mask = self.generate_span_mask(len(tokens))
        return {
            "input_ids": tokens,
            "target_mask": mask,
        }

import torch
import random
from torch.utils.data import Dataset
from tqdm import tqdm

class ChunkedTextDataset(Dataset):
    def __init__(self, dataset, tokenizer, max_length=256, batch_size=10000, min_span_length=32):
        self.min_span_length = min_span_length
        
        # Filter non-empty text
        texts = [t for t in dataset["text"] if t.strip()]
        
        # Tokenize in batches
        all_tokens = []
        for i in tqdm(range(0, len(texts), batch_size), desc="Tokenizing"):
            batch = texts[i:i + batch_size]
            encoded = tokenizer(batch, add_special_tokens=False)["input_ids"]
            for ids in encoded:
                all_tokens.extend(ids)
        
        tokens = torch.tensor(all_tokens)
        
        # Chunk into fixed-length sequences
        num_chunks = len(tokens) // max_length
        self.chunks = tokens[:num_chunks * max_length].view(num_chunks, max_length)
    
    def generate_span_mask(self, seq_len):
        split_idx = random.randint(self.min_span_length, seq_len - self.min_span_length)
        mask = torch.zeros(seq_len, dtype=torch.bool)
        mask[split_idx:] = True
        return mask
    
    def __len__(self):
        return len(self.chunks)
    
    def __getitem__(self, idx):
        tokens = self.chunks[idx]
        mask = self.generate_span_mask(len(tokens))
        return {
            "input_ids": tokens,
            "target_mask": mask,
        }

# Usage
tokenizer = AutoTokenizer.from_pretrained("gpt2")
train_dataset = ChunkedTextDataset(dataset["train"], tokenizer, max_length=256)
print(f"Number of chunks: {len(train_dataset)}")
print(f"Sample:", train_dataset[0])

Tokenizing:   0%|          | 0/3 [00:00<?, ?it/s]Tokenizing:  33%|███▎      | 1/3 [00:00<00:00,  2.00it/s]Tokenizing:  67%|██████▋   | 2/3 [00:01<00:00,  1.92it/s]Tokenizing: 100%|██████████| 3/3 [00:01<00:00,  2.91it/s]Tokenizing: 100%|██████████| 3/3 [00:01<00:00,  2.56it/s]

Number of chunks: 9343
Sample: {'input_ids': tensor([  796,   569, 18354,  7496, 17740,  6711,   796,   220,   198,  2311,
           73, 13090,   645,   569, 18354,  7496,   513,  1058,   791, 47398,
        17740,   357,  4960,  1058, 10545,   230,    99,   161,   254,   112,
         5641, 44444,  9202, 25084, 24440, 12675, 11839,    18,   837,  6578,
          764,   569, 18354,  7496,   286,   262, 30193,   513,  1267,   837,
         8811,  6412,   284,   355,   569, 18354,  7496, 17740,  6711,  2354,
         2869,   837,   318,   257, 16106,  2597,  2488,    12,    31,  2712,
         2008,   983,  4166,   416, 29490,   290,  6343,    13, 44206,   329,
          262, 14047, 44685,   764, 28728,   287,  3269,  2813,   287,  2869,
          837,   340,   318,   262,  2368,   983,   287,   262,   569, 18354,
         7496,  2168,   764, 12645,   278,   262,   976, 21748,   286, 16106,
          290,  1103,  2488,    12,    31,   640, 11327,   355,   663, 27677,
          837,   262,  1621,  4539, 10730,   284,   262,   717,   983,   290,
         5679,   262,   366, 17871,  5321,   366,   837,   257, 23634,  2422,
         4326,  7351,   262,  3277,   286,  7096,   544,  1141,   262,  5498,
         1898,  6839,  1810,   508,  1620,  3200,  2042,  4560,   290,   389,
        46852,  1028,   262, 11773,  4326,   366,  2199,   321,   265,    88,
        12552,   366,   764,   220,   198,   383,   983,  2540,  2478,   287,
         3050,   837,  6872,   625,   257,  1588,  6903,   286,   262,   670,
         1760,   319,   569, 18354,  7496, 17740,  2873,   764,  2893,   340,
        17383,   262,  3210,  3033,   286,   262,  2168,   837,   340,   635,
        25289,  3294, 16895,   837,   884,   355,  1642,   262,   983,   517,
        43486,   329,  2168, 29661,   764, 15684, 11915,   371,  4548,    64,
         8835,    73,   280,   290, 26777,  7286, 13704, 13231, 43354,  1111,
         4504,   422,  2180, 12784,   837,  1863,   351,   569, 18354,  7496,
        17740,  2873,  3437, 33687,  5303, 18024]), 'target_mask': tensor([False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True])}

prompt / response

I’m now happy with how this is splitting and chunking.

Can you create a pytorch lightning data module for processing this data?

🤖Reply🤖

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

import pytorch_lightning as pl
from torch.utils.data import DataLoader
from datasets import load_dataset
from transformers import AutoTokenizer

class LeJEPATextDataModule(pl.LightningDataModule):
    def __init__(
        self,
        tokenizer_name="gpt2",
        max_length=256,
        min_span_length=32,
        batch_size=32,
        num_workers=0,
    ):
        super().__init__()
        self.tokenizer_name = tokenizer_name
        self.max_length = max_length
        self.min_span_length = min_span_length
        self.batch_size = batch_size
        self.num_workers = num_workers
        
    def setup(self, stage=None):
        tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_name)
        dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
        
        self.train_dataset = ChunkedTextDataset(
            dataset["train"], tokenizer, self.max_length, min_span_length=self.min_span_length
        )
        self.val_dataset = ChunkedTextDataset(
            dataset["validation"], tokenizer, self.max_length, min_span_length=self.min_span_length
        )
    
    def train_dataloader(self):
        return DataLoader(
            self.train_dataset,
            batch_size=self.batch_size,
            shuffle=True,
            num_workers=self.num_workers,
        )
    
    def val_dataloader(self):
        return DataLoader(
            self.val_dataset,
            batch_size=self.batch_size,
            shuffle=False,
            num_workers=self.num_workers,
        )

Usage:

dm = LeJEPATextDataModule(batch_size=32)
dm.setup()
print(f"Train batches: {len(dm.train_dataloader())}")
print(f"Val batches: {len(dm.val_dataloader())}")

# Check a batch
batch = next(iter(dm.train_dataloader()))
print(f"input_ids shape: {batch['input_ids'].shape}")
print(f"target_mask shape: {batch['target_mask'].shape}")

import pytorch_lightning as pl
from torch.utils.data import DataLoader
from datasets import load_dataset
from transformers import AutoTokenizer

class LeJEPATextDataModule(pl.LightningDataModule):
    def __init__(
        self,
        tokenizer_name="gpt2",
        max_length=256,
        min_span_length=32,
        batch_size=32,
        num_workers=0,
    ):
        super().__init__()
        self.tokenizer_name = tokenizer_name
        self.max_length = max_length
        self.min_span_length = min_span_length
        self.batch_size = batch_size
        self.num_workers = num_workers
        
    def setup(self, stage=None):
        tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_name)
        dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
        
        self.train_dataset = ChunkedTextDataset(
            dataset["train"], tokenizer, self.max_length, min_span_length=self.min_span_length
        )
        self.val_dataset = ChunkedTextDataset(
            dataset["validation"], tokenizer, self.max_length, min_span_length=self.min_span_length
        )
    
    def train_dataloader(self):
        return DataLoader(
            self.train_dataset,
            batch_size=self.batch_size,
            shuffle=True,
            num_workers=self.num_workers,
        )
    
    def val_dataloader(self):
        return DataLoader(
            self.val_dataset,
            batch_size=self.batch_size,
            shuffle=False,
            num_workers=self.num_workers,
        )

dm = LeJEPATextDataModule(batch_size=32)
dm.setup()
print(f"Train batches: {len(dm.train_dataloader())}")
print(f"Val batches: {len(dm.val_dataloader())}")

# Check a batch
batch = next(iter(dm.train_dataloader()))
print(f"input_ids shape: {batch['input_ids'].shape}")
print(f"target_mask shape: {batch['target_mask'].shape}")

Tokenizing:   0%|          | 0/3 [00:00<?, ?it/s]Tokenizing:  33%|███▎      | 1/3 [00:00<00:00,  3.54it/s]Tokenizing:  67%|██████▋   | 2/3 [00:00<00:00,  2.36it/s]Tokenizing: 100%|██████████| 3/3 [00:00<00:00,  3.53it/s]Tokenizing: 100%|██████████| 3/3 [00:00<00:00,  3.26it/s]
Tokenizing:   0%|          | 0/1 [00:00<?, ?it/s]Tokenizing: 100%|██████████| 1/1 [00:00<00:00, 15.44it/s]

Train batches: 292
Val batches: 31
input_ids shape: torch.Size([32, 256])
target_mask shape: torch.Size([32, 256])

Model

Next, we need to define the model. There’s a few things to work out here: - How to apply SIGReg. - Defining an encoder that can encoder both the source and target spans. - Defining a predictor that can predict the target encoder’s representations of masked spans.

SIGReg

prompt / response

To start, I want to work out how to implement SIGReg. Can you implement this, referencing both the paper and this implementation on github. If easier I’m happy for you to use the github implementation directly.

🤖Reply🤖

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

read_url(url=https://github.com/galilai-group/lejepa)

{
  "id": "toolu_01JvPDFMtq6kKnbg8ei3JZ1r",
  "call": {
    "function": "read_url",
    "arguments": {
      "url": "https://github.com/galilai-group/lejepa"
    }
  },
  "result": "'# galilai-group/lejepa\\n\\n\\n## / Files\\n- LICENSE\\n- MINIMAL.md\\n- README.md\\n- \ud83d\udcc1 eval\\n- \ud83d\udcc1 figures\\n- \ud83d\udcc1 lejepa\\n- pyproject.toml\\n- \ud83d\udcc1 scripts\\n- setup.py\\n- \ud83d\udcc1 tests\\n\\n## README\\n# LeJEPA\\n**Lean Joint-Embedding Predictive Architecture (LeJEPA): Provable and Scalable Self-Supervised Learning Without the Heuristics**\\n[GitHub Repository](https://github.com/rbalestr-lab/lejepa)  \\n[arXiv:2511.08544](https://arxiv.org/abs/2511.08544)\\n---\\n\\nRush to our [minimal working example](MINIMAL.md) to see a full-fledge working example (ViT, inet).\\n\\n## Demo\\n\\n<img src=\"eval/output1.gif\" controls widt<TRUNCATED>"
}

read_url(url=https://github.com/galilai-group/lejepa/tree/main/…)

{
  "id": "toolu_01QMF2DitEf7Q3dKfLskhjh6",
  "call": {
    "function": "read_url",
    "arguments": {
      "url": "https://github.com/galilai-group/lejepa/tree/main/lejepa"
    }
  },
  "result": "'# galilai-group/lejepa\\n\\n\\n## /lejepa Files\\n- __init__.py\\n- \ud83d\udcc1 __pycache__\\n- \ud83d\udcc1 multivariate\\n- \ud83d\udcc1 univariate'"
}